Block Partitioning Information-Based CNN Post-Filtering for EVC Baseline Profile

The need for efficient video coding technology is more important than ever in the current scenario where video applications are increasing worldwide, and Internet of Things (IoT) devices are becoming widespread. In this context, it is necessary to carefully review the recently completed MPEG-5 Essential Video Coding (EVC) standard because the EVC Baseline profile is customized to meet the specific requirements needed to process IoT video data in terms of low complexity. Nevertheless, the EVC Baseline profile has a notable disadvantage. Since it is a codec composed only of simple tools developed over 20 years, it tends to represent numerous coding artifacts. In particular, the presence of blocking artifacts at the block boundary is regarded as a critical issue that must be addressed. To address this, this paper proposes a post-filter using a block partitioning information-based Convolutional Neural Network (CNN). The proposed method in the experimental results objectively shows an approximately 0.57 dB for All-Intra (AI) and 0.37 dB for Low-Delay (LD) improvements in each configuration by the proposed method when compared to the pre-post-filter video, and the enhanced PSNR results in an overall bitrate reduction of 11.62% for AI and 10.91% for LD in the Luma and Chroma components, respectively. Due to the huge improvement in the PSNR, the proposed method significantly improved the visual quality subjectively, particularly in blocking artifacts at the coding block boundary.


Introduction
The current growth in global video applications, driven by consumer desire for highquality experiences, has expanded the relevance of devices dramatically [1].This spike has resulted in a significant increase in frame rates per second to support natural motion, leading to an increase in video content capacity.According to Cisco statistics [2], video-related traffic accounts for around 80% of overall Internet traffic, highlighting the widespread relationship between data transmission and video content.This trend is not confined to 2D movies; it includes 3D videos, volumetric stereoscopic images, 360-degree videos, and VR/AR material, all of which require more data capacity [3].
At the same time, the increase in Internet of Things (IoT) systems has increased the need for effective video coding technology [4].The significant growth of video data in IoT systems pursues dedicated coding and processing methods.As these systems focus on local data processing for intelligent sensor nodes, the importance of minimizing data volume while ensuring high-quality decoded images becomes important [5].Thus, video coding technology has importance in nodes that extend beyond traditionally used areas, including small edge units of computing.
Traditionally, video compression technology has been developed through standards created by organizations such as the ISO/IEC Moving Picture Experiences Group (MPEG) and ITU-T Video Coding Experiences Group (VCEG).Standards such as MPEG-2/H.262[6], Advanced Video Coding (AVC)/H.264[7], and High Efficiency Video Coding (HEVC)/H.265[8] Sensors 2024, 24, 1336 2 of 15 have contributed significantly to the efficient compression and transmission of video data.Recently, new video coding standards such as Versatile Video Coding (VVC)/H.266[9] and MPEG-5 Essential Video Coding (EVC) [10] have been introduced.While VVC/H.266 was developed jointly by the MPEG and VCEG, EVC is a product exclusively for the MPEG.
In this context of the growth of video data in IoT systems, it is necessary to carefully review the completed EVC standard.In particular, the EVC Baseline profile is customized to meet the specific needs of handling IoT video data.Since this profile aims to build a royalty-free codec using conventional coding techniques that are more than 20 years old, focusing only on performing key functions, avoiding the integration of complex tools, results in high-performance compression even at a low complexity [11].Therefore, it is believed that the EVC Baseline profile proves to be a proper video codec for sensor node networks that require high-performance compression while operating at low power and complexity, and it is expected to play a pivotal role in addressing the growing need for high performance within the IoT ecosystem.
However, the EVC Baseline has a notable drawback.Being a codec comprised solely of simple tools developed over 20 years, it tends to exhibit numerous coding artifacts.Specifically, the presence of blocking artifacts at the block boundaries is considered a critical issue that needs resolution.To address this, a post-filter leveraging block partitioning information-based Convolutional Neural Network (CNN) is introduced in this paper.The proposed filter aims to rectify the challenges associated with the EVC Baseline profile, characterized by a high occurrence of coding artifacts.The proposed post-filter seeks to provide a high-efficiency compression performance and enhanced image quality, making it suitable for node sensor networks with low-complexity requirements.The main contributions of this study can be summarized as follows: (1) A CNN-based post-filter for the EVC Baseline profile was developed, offering a promising video coding solution for IoT devices.(2) An analysis of the major artifacts in the EVC Baseline profile was conducted, and a method indicating the area where these artifacts appear was exploited.(3) The incorporation of a guide map based on blocking partitioning information was implemented to identify attention areas and enhance visual quality in the target image and video.(4) Consideration was given to IoT applications with low complexity, allowing IoT devices to selectively add the post-filter based on the available extra computing power.(5) A scenario-based CNN-based post-processing network was developed for real IoT applications, whether in image-based or real-time broadcasting/streaming services.
The remainder of this paper is organized as follows.Section 2 provides an overview of the EVC Baseline profile, related works of CNN-based filtering technologies, and standard activity.The proposed method is presented in Section 3. Section 4 provides an overall performance evaluation and analysis.Finally, Section 5 concludes this paper.

Related Work
To examine the relevant work of the proposed method, this section initially offers background information on EVC Baseline profiles.Following that, it explores CNN-based filtering for video coding, encompassing both in-loop filtering and out-loop filtering.Finally, it will outline the recent developments in standards for neural network-based video coding at the Joint Video Exploration Team (JVET), a collaboration between the MPEG and ITU-T.

Overview of EVC Baseline Profile
The block structure of the EVC Baseline profile is based on a partitioning method that supports quadtree division based on 64 × 64 blocks.The maximum block size of the coding unit is 64 × 64 and the minimum size is 4 × 4. For intra prediction, the process is performed based on the coding unit block, incorporating five supported prediction modes.The intra prediction supported by the Baseline profile would be impossible to accurately predict direction, but the major intra directionality can be predicted, thereby reducing the redundancy of directional information within a frame.
The residual value generated in the prediction process is converted into a frequency value through Discrete Cosine Transform (DCT), and the converted coefficient value is converted into a quantized coefficient value through a quantization process.The size of the transformation aligns with the size of the prediction block, and a process with a quantization parameter (QP) in the range of 0 to 51 is used for quantization.After the quantization process, the quantized coefficient values are scanned through a zigzag scan order and are then binarized through a basic run-level coding method, and the binarized values are streamed to the entropy coding engine as described in JPEG Annex D [12].In the case of the filtering tool in the Baseline profile, an initial version of the deblocking filter in AVC/H.264 was applied to improve the objective and subjective image quality.The method is the same as that applied to H.263 Annex J [13].
For the EVC Baseline profile, it has been reported that it achieves approximately 30% bit savings in the objective evaluation and about 40% in the subjective evaluation compared to AVC/H.264, which is widely utilized on the Internet, while maintaining the same quality [14].Moreover, in terms of complexity, it exhibits one-fourth of the algorithmic complexity when compared to AVC/H.264, making it a promising candidate as an optimal compression codec for next-generation sensor nodes [15].

CNN-Based Filtering Technologies for Video Coding
To improve coding artifacts during the encoding and decoding process, the latest video coding standard comes equipped with an in-loop filter designed.In the case of the VVC/H.266standard, it incorporates three traditional in-loop filters: Deblocking Filter (DBF), Sample Adaptive Offset (SAO), and Adaptive Loop Filter (ALF).These filters are sequentially applied to the reconstructed frames.The DBF focuses on suppressing blocking artifacts at the block boundaries, while the SAO filter and the ALF aim to eliminate artifacts resulting from quantization.Despite the effectiveness of these filters, there is still considerable room for improvement in terms of visual quality.
Recently, developments have seen an active pursuit of research aimed at minimizing video coding artifacts using neural networks.That research focuses primarily on two aspects: (1) the design of a filter using a neural network for an in-loop filtering method applicable within the codec, similar to the DBF, SAO, and ALF, and (2) the investigation of a post-filter method that can be selectively applied outside the codec as needed.
Park et al. [16] introduced a CNN-based In-Loop Filter (IFCNN) capable of replacing the SAO in HEVC/H.265,and the proposed IFCNN showed a promising coding performance on Bjontegaard Delta bitrate (BD-BR) [17], with reductions of 2.6% and 2.8% for the Random-Access (RA) and Low-Delay (LD) configurations, respectively.Dai et al. proposed a Variable Filter Size Residual Learning Convolutional Neural Network (VRCNN) [18], designed to replace conventional filters in HEVC/H.265,such as the DBF and SAO, in HEVC/H.265.The proposed method in [18] utilized the variable block size of transform in HEVC/H.265;thus, residual learning led to faster convergence.According to [18], the VRCNN reduced the BD-BR by an average of 4.6% in the All-Intra (AI) configuration.Similar to the motivation of earlier methods, Kang et al. introduced a multi-scale CNN (MMS-net) [19] that could replace the DBF and SAO in HEVC/H.265by utilizing skip connections with different scales from subnetworks to enhance the restoration process.The proposed MMS-net's performance on the BD-BR showed a reduction of 8.5% for the AI configuration.Wang et al. [20] proposed an attention-based dual-scale CNN (ADCNN), which utilized the encoding information, such as the QP and partitioning information, and the proposed ADCNN's performance on the BD-BR showed reductions of 6.5% and 2.8% for the AI and RA configurations, respectively.The residual highway CNN (RHCNN) in [21] utilized residual units with a progressive training scheme for the QP bands, and the proposed RHCNN's performance on the BD-BR showed reductions of 5.7%, 4.4%, and 5.7% for the AI, RA, and LD configurations, respectively.Similar to the approach of [21], Wang et al. [22] applied a neural network-based in-loop filter (CNNLF) in the conventional video coding framework in VVC/H.266 by conducting the modules of feature extraction and image quality enhancement.Compared with VTM-15.0, the proposed CNNLF improved the PSNR by 0.4 dB and 0.8 dB at 0.1 Mbps, respectively, and by 0.2 dB and 0.5 dB at 1 Mbps, respectively.Huang et al. [23] also added the CNN-based network to the conventional video coding framework specifically between the DBF and SAO in VVC/H.266.The proposed method based on a variable CNN utilized an attention module into a residual block to extract informative features, and the proposed method in [23] showed reductions of 3.6%, 3.6%, and 4.2% in performance on the BD-BR for the AI, RA, and LD configurations, respectively.
The purpose for post-filtering approaches is similar to that of in-loop filtering; however, it is used outside of the codec architecture.Thus, CNN-based post-filtering algorithms can be selectively applied to decoded images to improve visual quality.Dong et al. [24] introduced a CNN-based artifact removal method (AR-CNN) designed for JPEG compressed images, which was an extension of the super-resolution CNN (SRCNN) from previous studies.The results presented in [24] demonstrated a 1 dB improvement achieved by the proposed AR-CNN when compared to JPEG images.Li et al. [25] presented a method employing a twenty-layer CNN architecture with residual learning.An interesting aspect of the method proposed in [25] involved transmitting side information related to video content complexity and quality indicators from the encoder to the decoder at each frame.The performance of the method, as reported in [25], demonstrated a 1.6% BD-BR reduction compared with HEVC/H.265 on the six sequences given in the 2017 ICIP Grand Challenge.Zhang et al. [26] introduced a post-processing architecture based on a CNN for VVC/H.266compressed video sequences.This architecture utilized 16 identical residual blocks and incorporated three types of skip connections, and it was reported that the proposed method in [26] showed a reduction of 3.9% in performance on the BD-BR for the RA configuration compared to VVC/H.266.The authors extended the [26] method, incorporating a generative adversarial network (GAN)-based training strategy to improve the visual quality of VVC/H.266-decodedimages.The proposed method in [27] showed a notable enhancement in perceptual visual quality, achieving a reduction of 3.9% in performance on the BD-BR for the RA configuration compared to VVC/H.266.Bonnineau et al. [28] introduced a multitask learning-based approach that employed a QP map to generalize the model with various QPs by sharing parameters within a single network and task-specific modules.The method presented in [28] exhibited a significant improvement in perceptual visual quality, achieving a reduction of 2.8% in performance on the BD-BR for the RA configuration compared to VVC/H.266.Wang et al. [29] aimed to enhance the visual quality of decoded images by incorporating partitioning information with QP information, introducing a three-branch network.The method described in [29] demonstrated a notable improvement in perceptual visual quality, achieving a reduction of 6.5% in performance on the BD-BR for the AI configuration compared to VVC/H.266.Meng et al. [30] presented a network for enhancing visual quality, combining temporal motion and spatial information through a fusion subnet and an enhancement subnet.The approach outlined in [30] showed a significant improvement in perceptual visual quality, achieving a 0.29 dB enhancement compared to VVC/H.266-decoded images.

Neural Network-Based Video Coding
Meanwhile, various applications have recently explored the advancement in neural network (NN) technology.For instance, machine learning is leveraged in natural language processing and computer vision to overcome performance barriers.This trend is also making an impact on the development of video coding.The JVET is actively monitoring the adoption of NN technology and has initiated research into Neural Network-based Video Coding (NNVC) [31].During the 130th MPEG meeting and 19th JVET meeting, two independent Ad Hoc Groups (AHGs) related to NNVC were formed, both focusing on the development of (1) an end-to-end (E2E) video coding framework and (2) the integration of NN in a hybrid video coding framework.Subsequently, these two AHGs were consolidated under the JVET, with the merged group tasked with assessing the feasibility of NNVC for potential coding gains compared to traditional video coding standards based on signal processing technology.Currently, the development of in-loop filtering mainly using neural networks is being actively discussed in the JVET.It should be noted that in JVET activities, the main architecture of the network is based on a res-block CNN structure.Considering the fact that video coding generally uses a residual-based encoding/decoding approach that relies on accurate predictions about intra/interframes, the focus for improvement is mainly on preserving the details expressed through content distribution without changing the DC value.Thus, this approach, using the res-block basis CNN architecture, aligns well with the overall architecture of video coding, proving effective for in-loop filtering.

CNN-Based Post-Filtering with Block Partitioning Information
In the previous section, we reviewed the filtering technologies employed in conventional video coding standards and the recently emerged neural network-based filtering methods.While the future outlook for neural network-based filtering technologies appears promising, it is acknowledged that they still present challenges in terms of complexity.Given this context, one might argue that a post-filter, capable of adaptively enhancing image quality as needed, is more practical than an in-loop filter, which must be consistently applied to sensor nodes requiring fast processing with low complexity.Therefore, this paper proposes a CNN-based post-filter for EVC, aiming to enhance the image quality and compression rates while maintaining the constraints of low power and low complexity.

Analysis of Coding Artifacts
The EVC Baseline profile employs a quadtree-based coding structure, allowing the utilization of blocks up to 64 × 64, as illustrated in Figure 1.This method involves determining the optimal block size through processing from 64 × 64 to 4 × 4 in the encoder and transmitting this information to the decoder based on the quadtree.For example, during the decoding process, if the split flag is 0, the coding block for the process is 64 × 64.If the split flag is 1, four additional split flags are transmitted, indicating whether the coding block should be divided into units of 32 × 32.This process continues until the information is transmitted down to 4 × 4, a leaf node.The size of the coding block is determined according to the characteristics of the content, and specifically, the coding block is determined as a large block in homogeneous areas and a small block in delicate areas.Nevertheless, while the EVC block decision process ensures optimal rate-distortion (RD) performance, the absence of high-performance in-loop filtering in the EVC Baseline profile leads to the generation of significant artifacts around the block.Errors in the video coding process include ringing artifacts, blocking artifacts, and bending artifacts.Among these, the most noticeable artifact for video consumers is the blocking artifact, primarily occurring at the block boundaries in block-based video coding.Specifically, in the EVC Baseline profile, the discontinuity at the block boundary is pronounced, leading to a significant degradation in the image quality of the decoded image.Figure 2 shows an example of the result of encoding by the EVC Baseline profile on QP = 37 to the RaceHorses sequence, clearly showing the prominent presence of blocking artifacts at the block boundary.The problem is that the Baseline profile contains an excessive Errors in the video coding process include ringing artifacts, blocking artifacts, and bending artifacts.Among these, the most noticeable artifact for video consumers is the blocking artifact, primarily occurring at the block boundaries in block-based video coding.Specifically, in the EVC Baseline profile, the discontinuity at the block boundary is pronounced, leading to a significant degradation in the image quality of the decoded image.Figure 2 shows an example of the result of encoding by the EVC Baseline profile on QP = 37 to the RaceHorses sequence, clearly showing the prominent presence of blocking artifacts at the block boundary.The problem is that the Baseline profile contains an excessive number of such blocking artifacts.To address this concern, our research aims to improve the visual quality of the decoded images produced by the EVC Baseline profile.This improvement is accomplished by employing a block partitioning strategy within the context of CNN-based post-filtering.Errors in the video coding process include ringing artifacts, blocking artifacts, and bending artifacts.Among these, the most noticeable artifact for video consumers is the blocking artifact, primarily occurring at the block boundaries in block-based video coding.Specifically, in the EVC Baseline profile, the discontinuity at the block boundary is pronounced, leading to a significant degradation in the image quality of the decoded image.Figure 2 shows an example of the result of encoding by the EVC Baseline profile on QP = 37 to the RaceHorses sequence, clearly showing the prominent presence of blocking artifacts at the block boundary.The problem is that the Baseline profile contains an excessive number of such blocking artifacts.To address this concern, our research aims to improve the visual quality of the decoded images produced by the EVC Baseline profile.This improvement is accomplished by employing a block partitioning strategy within the context of CNN-based post-filtering.The architecture in our proposed method is an extension of [32].In the previous work, we utilized QP map information in the context of CNN-based post-filtering targeting the VVC/H.266standard.In this paper, we extended this concept by integrating the block partitioning information into CNN-based post-filtering, specifically targeted for the EVC Baseline profile, which is suitable for video data transmission in sensor nodes.Figure 4 outlines the comprehensive network design employed in our proposed method.The architecture in our proposed method is an extension of [32].In the previous work, we utilized QP map information in the context of CNN-based post-filtering targeting the VVC/H.266standard.In this paper, we extended this concept by integrating the block partitioning information into CNN-based post-filtering, specifically targeted for the EVC Baseline profile, which is suitable for video data transmission in sensor nodes.Figure 4 outlines the comprehensive network design employed in our proposed method.

Architecture and Network
The architecture in our proposed method is an extension of [32].In the previous work, we utilized QP map information in the context of CNN-based post-filtering targeting the VVC/H.266standard.In this paper, we extended this concept by integrating the block partitioning information into CNN-based post-filtering, specifically targeted for the EVC Baseline profile, which is suitable for video data transmission in sensor nodes.Figure 4 outlines the comprehensive network design employed in our proposed method.In the initial processing block, the decoded image from the EVC Baseline profile is combined with a block partitioning map and framed.Both the decoded image from the EVC Baseline profile and the block partitioning map operate based on YUV channels.After concatenation, the number of input channels doubled.This package uses the decoded image from the EVC Baseline profile as the target for improvement, and the block partitioning map guides the areas with artifacts in the target decoded image.
The packaged video is then fed into the head block, comprising a 1 × 1 convolution filter with 128 output channels and a Parametric Rectified Linear Unit (PReLU).The primary function of the head block is to decrease the input dimensions for the subsequent backbone blocks.In this head block, we configured 128 channels to generate multiple feature representations, and the resulting 128 channels undergo the activation function (i.e., PReLU) to apply non-linearity to the output of the head block.
The features extracted from the head block are then directed to the backbone blocks, which encompass multiple blocks focused on extracting features.In the fundamental feature extraction block, a 3 × 3 convolution filter with 128 output channels and a PReLU is In the initial processing block, the decoded image from the EVC Baseline profile is combined with a block partitioning map and framed.Both the decoded image from the EVC Baseline profile and the block partitioning map operate based on YUV channels.After concatenation, the number of input channels doubled.This package uses the decoded image from the EVC Baseline profile as the target for improvement, and the block partitioning map guides the areas with artifacts in the target decoded image.
The packaged video is then fed into the head block, comprising a 1 × 1 convolution filter with 128 output channels and a Parametric Rectified Linear Unit (PReLU).The primary function of the head block is to decrease the input dimensions for the subsequent backbone blocks.In this head block, we configured 128 channels to generate multiple feature representations, and the resulting 128 channels undergo the activation function (i.e., PReLU) to apply non-linearity to the output of the head block.
The features extracted from the head block are then directed to the backbone blocks, which encompass multiple blocks focused on extracting features.In the fundamental feature extraction block, a 3 × 3 convolution filter with 128 output channels and a PReLU is employed.The 3 × 3 convolution filter plays a crucial role in extracting features from the input and generating essential features for the subsequent layers.The 128 output channels from the 3 × 3 convolution filter undergo the PReLU activation function, and this process is repeated through the layers up to the final feature extraction block to ensure convergence in the deeper layers of the network.
In the proposed method described in this paper, we utilized 16 feature blocks based on empirical studies, but this number can be adjusted depending on the characteristics of the input decoded image.To maintain network simplicity, we designed the backbone block with a shape similar to the head block.While the head block primarily reduces the input dimensions, the backbone block focuses on capturing residual features for training.
The tail block, responsible for processing the output channels from the backbone blocks, integrates a 1 × 1 convolution filter with three output channels and employs the Tanh activation function, replacing the PReLU.To achieve precise quality improvement, we chose to update the residuals of the decoded image.As a result, the input decoded image from the head block is connected to the output of the tail block through a skip connection.The residual updates ensure that the primary values of the decoded image remain unaltered, while enabling adjustments to the corrupted areas introduced during the encoding and decoding processes, which represent the core objective of the proposed method.

Training
To create the training dataset, we utilized the BVI-DVC [33] dataset, comprised of 800 videos of varying resolutions ranging from 270 p to 2160 p, providing a diverse set of training data.Given that the BVI-DVC dataset is based on mp4 files, we converted these files to the YUV420 format with 10-bit files for the training dataset using FFmpeg [34].To streamline the dataset creation, we extracted 10 frames from each video, resulting in a training dataset of 8000 frames.Ensuring uniform sizes for each Y, U, and V channel, we upsampled the U and V channels to match the size of the Y channel.Following the conversion, the original YUV format videos were processed through XEVE [35] and XEVD [36] to produce decoded images in the YUV format.Subsequently, instead of utilizing the entire image size for the training dataset, we cropped each image from the original and decoded YUV images to a size of 256 × 256.We then randomly selected the cropped images produced through horizontal and vertical flipping processes.
For the models in the proposed method, we generated five models corresponding to the QP values.Standard groups, such as the MPEG and JVET, use a common test condition (CTC) for experiments to evaluate suggested contributions, usually utilizing four or five QP values.A QP value in a codec plays an important role in this process.An increase in the QP results in higher distortion due to a coarser quantization step applied to transform coefficients with a larger QP.This leads to the loss of high-frequency information and a broader distribution range for the compensation value between the reconstructed and original pixels.Conversely, a low QP value yields better visual quality but requires a relatively high bitrate.Therefore, the QP number serves as a fundamental control parameter determining the visual quality and bitrate of the video.
In the proposed method, the utilization of models dependent on the QP is a critical aspect contributing to the generation of high-quality outputs.During the training process, we generated five bitstreams and reconstructed YUV files, depending on the QP value, in accordance with the experiments carried out by the JVET CTC [31].Subsequently, the proposed model was trained using these five bitstreams and reconstructed YUV files independently.Additionally, we generated models based on different configurations as well.Acknowledging that error characteristics vary with the QP and configuration, we developed a strategy to customize each model to specific error characteristics.The separate models for each scenario in the proposed approach are to ensure that the model is tailored to the specific requirements of each scenario.For instance, the AI model is trained for image-centric applications, while the LD model is designed for real-time broadcasting and streaming purposes.More detailed information on the training process is available in Table 1.

Experimental Results and Discussion
To evaluate the effectiveness of the proposed method, the JVET CTC [37] sequences were chosen for evaluation but were not included in the training dataset.These 19 sequences were classified into classes A1, A2, B, C, and D based on their resolution with characteristics.The test QP values for all configurations were 22, 27, 32, 37, and 42, corresponding to the JVET CTC.Given the potential applications for the EVC Baseline profile with CNN-based Sensors 2024, 24, 1336 9 of 15 post-filtering, which could be used for a low-complexity, low-power sense node for video data transmission, we evaluated the proposed method using AI and LD configurations.Table 2 contains detailed test sequences and conditions.

Objective Testing Result
For the objective evaluation, the increase in the PSNR was measured at the same bitrate of each sequence over all the QP values (i.e., BD-PSNR) [37], and the BD-BR was also measured to check on the bitrate reduction at the same visual quality that is usually used in standard experiments.Tables 3 and 4 present the experimental results of the proposed method for the AI and LD configurations, comparing the decoded image of the EVC Baseline profile to the enhanced image filtered by the CNN-based post-filter.The numbers in the table represent the average bitrate and PSNR of the five QPs of each sequence in the reference and proposed method.
Those metrics, the BD-PSNR and BD-BR, compare the improvement in the PSNR and coding efficiency of different video codecs or encoding settings while taking into consideration both the bitrate and video quality.The fundamental concept involved fitting a cubic polynomial curve through five data points and subsequently deriving an expression for the integral of the curve.The BD-PSNR allows for an objective assessment of PSNR improvement by calculating the difference in the PSNR to achieve a comparable bitrate between two codecs.In the BD-PSNR, a higher number indicates an improvement in the PSNR over the anchor.Similarly, by measuring the difference in the bitrate needed to attain an equivalent quality level between two distinct codecs, the BD-BR metric facilitates an objective assessment of the compression efficiency.The lower BD-BR value signifies a higher coding efficiency than the anchor at the same visual quality.
Table 3 shows the results of the proposed method compared to the reference in the AI configuration.As shown in Table 3, the proposed method increases the PSNR at the same bitrate by approximately 0.57 dB, 0.75 dB, and 0.95 dB for the Luma and Chroma components in the AI configuration when compared to the post-filter pre-processed video.The increased PSNR results in overall BD-BR reductions of 11.62%, 24.5%, and 28.79% for the Luma and Chroma components, respectively, in the AI.
Table 4 shows the results of the proposed method compared to the reference in the LD configuration.Similar results can be observed in the LD configuration.As shown in the table, the proposed method increases the PSNR at the same bitrate by approximately 0.37 dB, 0.82 dB, and 0.95 dB for the Luma and Chroma components in the LD configuration when compared to the post-filter pre-process video.The improved PSNR results in overall BD-BR reductions of 10.91%, 31.22%, and 32.30% for the Luma and Chroma components, respectively, in the LD.The objective test results show that the proposed method significantly improved the visual quality of the decoded images of the EVC Baseline profile, regardless of the QP or configuration.Notably, the proposed method outperforms the AI configuration in terms of a low resolution and high QP values.The main reason for the significant improvements can be attributed to areas where blocking artifacts noticeably appear.Blocks in the AI configuration typically determine the number of coding blocks with a small coding block size; increasing the number of coding blocks results in more blocking artifacts in the block boundary area.This phenomenon can also be applied to the output of low-resolution sequences.Because of the high amount of quantized values for en/decoding, the artifact can be widely visible in coding blocks regardless of the coding block size or number.As a result, the improvements at high QP sequences would be due to an improvement in visual quality across the entire decoded image.

Subjective Testing Result
To assess the improvement in visual quality achieved by our proposed method, individual visual quality evaluations were conducted.Figures 5 and 6 present a comparative analysis of the visual quality for the AI and LD configurations between the decoded image of the EVC Baseline profile and the proposed results.The visual quality assessment was performed at a middle QP value as QP = 32.Figure 5 illustrates the comparison results for PartyScene with the AI configuration.The filtered image by post-filtering reveals a superior visual quality compared to the video before post-filtering, aligning with the 0.49 dB improvement observed in the objective evaluation.Notably, the figure of the proposed method in Figure 5 shows a further reduction in artifacts, especially in the face of the child and around the area of the boxes.Similar results are evident in the LD configuration. Figure 6 shows the comparison results for BQTerrace with the LD configuration.In this figure, the filtered image by the proposed method reveals more textural detail than the reference, especially notable in the parasol, where texture lines are clearly observed in the proposed filtered image.

Subjective Testing Result
To assess the improvement in visual quality achieved by our proposed method, individual visual quality evaluations were conducted.Figures 5 and 6 present a comparative analysis of the visual quality for the AI and LD configurations between the decoded image of the EVC Baseline profile and the proposed results.The visual quality assessment was performed at a middle QP value as QP = 32.Figure 5

Discussion
The experimental results of the proposed method showed both objective and subjective improvements in performance.In the objective results, the PSNR showed a significant enhancement in both the AI and LD configurations.This PSNR improvement corresponds to a reduced bitrate at the same image quality.Considering the historical fact of video coding standards improving performance by about 50% every decade, achieving a 10% enhancement with just one tool is quite impressive objectively.The subjective experimental results also reveal a remarkable improvement in image quality.The proposed method effectively addresses blocking artifacts, a specific target of enhancement, noticeably eliminating them.Another noteworthy aspect in the subjective experiments is the preservation of details in the original video that typically disappear due to coding artifacts.The proposed method successfully restores these lost details, bringing the visual quality closer to the original video.The significant improvement in visual quality is attributed to the CNN-based post-filter guided by the partitioning map, which identifies areas affected by blocking artifacts and guides the CNN model to enhance these areas in line with the original video.
Nevertheless, it should be noted that implementing the proposed CNN-based method involves considerable complexity, particularly when applied directly to devices within the IoT.The decision to employ a post-filter for enhancing image quality, accounting for diverse computing performances, aligns with the practical challenge of deploying CNN-based filters across the spectrum of IoT devices.In this context, it is believed to be more pragmatic to adaptively apply these filters as add-ons using post-filtering instead of in-loop filtering when external resources are available, as illustrated in the proposed configuration.

Future Work
This paper introduced a CNN-based post-filtering method designed for the EVC Baseline profile to address the requirements of IoT devices.While the proposed method is specifically applied to the EVC Baseline profile, its foundational architecture, shared with other video coding standards, suggests its potential applicability in the EVC Main profile or diverse video codecs.The exploration of extending the method outlined in this paper to other video codecs is considered an interesting topic for future research.
Additionally, it is noted that CNN remains an important role in enhancing the coding performance in this paper.With NNVC in the JVET successfully employing CNN-based deep-learning technology, a similar investigation into CNN-based filtering technology for the EVC Baseline profile has been conducted in the proposed method.Given the novelty of the MPEG-5 EVC Baseline profile and the limited research on leveraging CNN as a postfilter for this codec, exploring this aspect in conjunction with the latest machine-learning techniques is also considered an interesting topic for future research.

Conclusions
In this paper, a post-filter utilizing a CNN with block partition information for the EVC Baseline profile was proposed.As the demand for efficient video coding technology intensifies, driven by the surge in video data from IoT devices, the EVC Baseline can be considered as a promising solution designed to address the specific requirements of processing IoT video data with low complexity.Nevertheless, enhancements are required to address coding artifacts within the EVC Baseline profile.To tackle this issue, a postfilter utilizing a CNN based on block partition information was introduced in this paper.Through experimental results, both objective and subjective assessments showed significant improvements in both the AI and LD configurations when compared to the pre-post-filter video.The advancements achieved by the proposed method notably enhanced the visual quality, especially in blocking artifacts at the boundaries.Thus, this proposed method is expected to benefit networks of high-performance, low-complexity sensor nodes in IoT ecosystems using the EVC Baseline profile.

Figure 2 .
Figure 2. Example of coding artifacts detected in the area of the block boundary encoded with the EVC Baseline profile at the RaceHorses sequence with QP = 37.

Figure 3
Figure 3 depicts the overall pipeline for applying the proposed filtering in this paper.As depicted in the figure, in the case of this proposed post-filter, a filtering process is performed in the out-loop with the decoded image of the EVC Baseline profile.The CNNbased post-filter takes the decoded image and the block partitioning information extracted during the decoding process as the input and then improves the image quality by passing it through the trained CNN model.

Figure 2 .
Figure 2. Example of coding artifacts detected in the area of the block boundary encoded with the EVC Baseline profile at the RaceHorses sequence with QP = 37.

Figure 3
Figure 3 depicts the overall pipeline for applying the proposed filtering in this paper.As depicted in the figure, in the case of this proposed post-filter, a filtering process is performed in the out-loop with the decoded image of the EVC Baseline profile.The CNNbased post-filter takes the decoded image and the block partitioning information extracted during the decoding process as the input and then improves the image quality by passing it through the trained CNN model.Sensors 2024, 24, x FOR PEER REVIEW 7 of 16

Figure 3 .
Figure 3.The overall pipeline for applying the proposed post-filtering in the use case.

Figure 3 .
Figure 3.The overall pipeline for applying the proposed post-filtering in the use case.

Figure 4 .
Figure 4. Proposed CNN-based post-filtering with block partitioning information.

Figure 4 .
Figure 4. Proposed CNN-based post-filtering with block partitioning information.

Funding:
This work was supported by a grant from Kyung Hee University in 2023 (KHU-20230874) and was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2023-00220204, Development of Standard Technologies on next-generation media compression for transmission and storage of Metaverse contents).Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.

Table 1 .
Details of the training environment.

Table 2 .
Details of the testing environment.

Table 3 .
Objective testing result of AI configuration.

Table 4 .
Objective testing result of LD configuration.